Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bring back agi::fs::path to ensure UTF-8 paths #231

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

arch1t3cht
Copy link
Member

@arch1t3cht arch1t3cht commented Dec 21, 2024

See the message for the second commit, also pasted here:


On Windows, std::filesystem::path internally stores paths in UTF-16,
but constructing an std::filesystem::path from a string reads that
string in Windows-1252 or some other non-UTF-8 narrow encoding. This
breaks all kinds of code that previously assumed that one could simply
convert between UTF-8 strings, wstrings, and paths freely.

Before the switch from boost::filesystem to std::filesystem, this was
solved by using boost::filesystem::path::imbue to configure
boost::filesystem to always use UTF-8. However, there is no equivalent
function for std::filesystem. It seems that the encoding used can be
controlled to some degree using the C and C++ locales, but changing
these to UTF-8 breaks other things (and global locales are a headache
in general. I won't pull a wm4 here but you probably know what I mean).

So, there does not seem to be any easy solution to this. Aegisub also
isn't the only program to have this problem, see e.g.
https://www.bunkus.org/2021/03/converting-a-c-code-base-from-boostfilesystem-to-stdfilesystem/

As far as I can see, the three options are

  • Somehow mess with the global locales until everything magically works.
    This feels risky, might not work on all systems, and could break in
    the future.
  • Audit the entire code base and check every single conversion between
    strings and paths (Yeah, no)
  • Reinvent the wheel and write a wrapper class that fixes
    std::filesystem::path by forcing all conversions from and to
    std::string to use UTF-8.

So, here we are. It doesn't feel great to have another reinvention of
something that shouldn't be Aegisub's responsibility in the first place,
and we just got rid of all the agi::fs wrapper code, but this seems
like the only sane way to be sure that all conversions happen the way we
expect. I guess since agi::fs wraps std::filesystem and not
boost::filesystem this time, it's still better than before.

Incidentally, std::u8string seems to be kind of a meme too. The idea of
being explicit about your string being UTF-8 is great, but how is there
not even a standard function to reinterpret a string as UTF-8 or
vice-versa?? Let alone support in any other string handling or I/O
functions.

The changeset is pretty big, but the main changes are in fs.h/fs.cpp .
The rest is just a few find&replace calls and a handful of manual fixes.

Finally, it should be noted that conversion between
std::filesystem::paths and std::wstrings is broken on gcc <= 11:
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95048
This is what currently causes the added lagi_mru.add_entry_utf8 test
to fail on the Ubuntu CI. Clang and newer versions of gcc work, though.

Fixes #219.

@arch1t3cht arch1t3cht force-pushed the fix_path_encoding branch 3 times, most recently from 130be96 to 6a4852b Compare December 22, 2024 14:55
This fails on Windows since the switch to std::filesystem,
but would have succeeded with boost::filesystem.
On Windows, std::filesystem::path internally stores paths in UTF-16,
but constructing an std::filesystem::path from a string reads that
string in Windows-1252 or some other non-UTF-8 narrow encoding. This
breaks all kinds of code that previously assumed that one could simply
convert between UTF-8 strings, wstrings, and paths freely.

Before the switch from boost::filesystem to std::filesystem, this was
solved by using boost::filesystem::path::imbue to configure
boost::filesystem to always use UTF-8. However, there is no equivalent
function for std::filesystem. It seems that the encoding used can be
controlled to some degree using the C and C++ locales, but changing
these to UTF-8 breaks other things (and global locales are a headache
in general. I won't pull a wm4 here but you probably know what I mean).

So, there does not seem to be any easy solution to this. Aegisub also
isn't the only program to have this problem, see e.g.
https://www.bunkus.org/2021/03/converting-a-c-code-base-from-boostfilesystem-to-stdfilesystem/

As far as I can see, the three options are
- Somehow mess with the global locales until everything magically works.
  This feels risky, might not work on all systems, and could break in
  the future.
- Audit the entire code base and check every single conversion between
  strings and paths (Yeah, no)
- Reinvent the wheel and write a wrapper class that fixes
  std::filesystem::path by forcing all conversions from and to
  std::string to use UTF-8.

So, here we are. It doesn't feel great to have another reinvention of
something that shouldn't be Aegisub's responsibility in the first place,
and we *just* got rid of all the agi::fs wrapper code, but this seems
like the only sane way to be sure that all conversions happen the way we
expect. I guess since agi::fs wraps std::filesystem and not
boost::filesystem this time, it's still better than before.

Incidentally, std::u8string seems to be kind of a meme too. The idea of
being explicit about your string being UTF-8 is great, but how is there
not even a standard function to reinterpret a string as UTF-8 or
vice-versa?? Let alone support in any other string handling or I/O
functions.

The changeset is pretty big, but the main changes are in fs.h/fs.cpp .
The rest is just a few find&replace calls and a handful of manual fixes.

Finally, it should be noted that conversion between
std::filesystem::paths and std::wstrings is broken on gcc <= 11:
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95048
This is what currently causes the added lagi_mru.add_entry_utf8 test
to fail on the Ubuntu CI. Clang and newer versions of gcc work, though.

Fixes #219.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

opening subtitles will open an empty program: 3.4.0
1 participant